Exploratory Analysis

So you come into work on a Monday morning and find that your boss has tasked you with finding some insights on a current data set. You open the excel file and see that there are countless rows and columns filled with data. What do you do next?

Well, because you are a trendy analyst or someone just looking to learn new skills, you decide to use Python.

In this short blog, I will teach you how to begin looking for insights into your data, or in other terms, exploratory analysis. We will use Python 3 and the libraries within Python to start our journey.

First off, you are going to import the Pandas library, Matplotlib library, and the Seaborn library.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

These are just a few of the libraries that are available for use within the Python language and in our case, will be the main libraries used for exploratory analysis.

Next, you will call your bosses saved excel file name that you would have saved as a CSV file.

example_1 = pd.read_csv('~/Documents/projects/project_1/data/act_2018.csv')

Now that you have loaded your file into the notebook, or text editor, you can begin the initial analysis with some Pandas functions.

Run the .head() function to show the first five rows, and all the column names to see what each is called - this will help when you start plotting later.

example_1.head()

	State	Participation	English	Math	Reading	Science	Composite
0	National	55	20.2	20.5	21.3	20.7	20.8
1	Alabama	100	18.9	18.3	19.6	19.0	19.1
2	Alaska	33	19.8	20.6	21.6	20.7	20.8
3	Arizona	66	18.2	19.4	19.5	19.2	19.2
4	Arkansas	100	19.1	18.9	19.7	19.4	19.4

Next, run the .tail() function to show the bottom of your DataFrame.

example_1.tail()

	State	Participation	English	Math	Reading	Science	Composite
47	Virginia	24	23.8	23.3	24.7	23.5	23.9
48	Washington	24	21.4	22.2	22.7	22.0	22.2
49	West Virginia	65	19.8	19.4	21.3	20.4	20.3
50	Wisconsin	100	19.8	20.3	20.6	20.8	20.5
51	Wyoming	100	19.0	19.7	20.6	20.3	20.0

Now that you have a basic understanding of your data, you should check to see if there are any null/zero values in your columns and what type of data is in your columns.

example_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 7 columns):
State            52 non-null object
Participation    52 non-null int64
English          52 non-null float64
Math             52 non-null float64
Reading          52 non-null float64
Science          52 non-null float64
Composite        52 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 3.0+ KB

In this data set, you can see that there are no null values and that there is a mixture of object type data and float type data and int type data.

Next, you can use the .describe().T function to see things like min and max values in each column, as well as standard deviation and the mean in each column.

example_1.describe().T

	count	mean	std	min	25%	50%	75%	max
Participation	52.0	61.519231	33.757782	7.0	29.250	65.50	100.000	100.0
English	52.0	20.973077	2.424719	16.6	19.100	20.20	23.700	26.0
Math	52.0	21.113462	2.017573	17.8	19.400	20.65	23.125	25.2
Reading	52.0	22.001923	2.148186	18.0	20.475	21.45	24.050	26.1
Science	52.0	21.332692	1.853848	17.9	19.925	20.95	23.025	24.9
Composite	52.0	21.473077	2.087696	17.7	19.975	21.05	23.525	25.6

You can see with just a few simple functions from the Pandas library; we have gained a significant understanding of our numerical data.

And now, its time to bring in the next library, Matplotlib, which gives us the ability to start plotting some data.

Let's get started first by plotting each column in simple histograms. Maybe it will provide some insight into our data - like if it is normalized or not.

plt.figure(figsize=(4,4))
plt.hist(example_1['English']);

png

plt.figure(figsize=(4,4))
plt.hist(example_1['Math']);

png

plt.figure(figsize=(4,4))
plt.hist(example_1['Reading']);

png

plt.figure(figsize=(4,4))
plt.hist(example_1['Science']);

png

plt.figure(figsize=(4,4))
plt.hist(example_1['Composite']);

png

You can use Matplotlib to plot a few different styles of graphs quickly, but for exploratory, I like to keep things simple with either histograms or scatterplots because they visualize the data efficiently.

Remember, when doing exploratory analysis, you don't need to make things pretty, you need to make things easy for you to read and understand.

The last visual image I like to use is a heatmap; it shows correlation among our numerical columns, cleanly and effectively. This is where we use Seaborn.

fig, ax = plt.subplots(figsize=(6,6))  
sns.heatmap(example_1.corr(), annot=True, cmap="icefire");

png

You can see the value of the heatmap right away. The human eye interprets colour coordination quickly and effectively. Using the "icefire" colorway, you can see that Participation is negatively correlated with all of the scores provided by the ACT. A powerful insight that you maybe wouldn't have noticed without the power of a heatmap.

And that's it. In less than 20 minutes, expert or not in Python, you can quickly explore your data and see what trends and interpretations you can make. You can now report back to your boss what you are seeing, or if you feel like you'd want to go further, check back soon for the next blog!